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ABSTRACT Pairwise sequence comparison methods have 
been assessed using proteins whose relationships are known 
reliably from their structures and functions, as described in 

£rif?!!. n ^ , «li M , nRlD - G - Brenner ' S - Hubbard, T. 
& Chothia C. (1995) /. Mol. Biol. 247, 536-540]. The evalua- 
tion tested the programs blast [Aluchul, S. R, Gish W 
Miller, W, Myers. E. W. & Upman, D. J. (1990). J. Mol. Biol 
215, 403-410], WD-BLAST2 [AlUchul, S. F. & Gish, W. (1996) 
Methods Eniymol. 266, 460-480], facta fPearson, W. R. & 
Lipman, D. J. (1988) Proc. Nad. Acad. Set. USA 85. 2444-24481 

H ^Zif^ I* !' ? W " te "»». M. S. (1981)/. M£ 
JW. 147, 195-197] and their scoring schemes. The error rate 
of all algorithms is greatly reduced by using statistical scores 

r,U " r a, " B ***** "•""•y «r raw 
scores The E-value statistical scores of ssearch and facta are 
reliable: the number of false positives found in our tests agrees 
well with the scores reported. However, the P-values reported 
by blast and wu-blacti exaggerate significance by orders of 
magnitude, ssearch, facta letup = 1, and w-blactj perform 
best, and they are capable of detecting almost all relationships 
between proteins whose sequence identities are >30%. For 
more distantly related proteins, they do much less well; only 
one-half of the relationships between proteins with 20-30% 
identity are found. Because many homologs have low sequence 
similarity, most distant relationships cannot be detected by 
any pairwise comparison method; however, those which are 
identified may be used with confidence. 

Sequence database searching plays a role in virtually every 
branch of molecular biology and is crucial for interpreting the 
sequences issuing forth from genome projects. Given the 
r,n a( ,r. S Ce ^ tI ?■ l « ^0,C • h is SUf Pri«ng that overall and relative 
capab, .ties of different procedures are largely unknown. It is 
difficult to verify algorithms on sample data because this 
requires large data sets of proteins whose evolutionary rela- 
tranships are known unambiguously and independently of the 
me nods being evaluated. However, nearly all known ho- 
mologs have been identified by sequence analysis (the method 
to be tested). Also, it is generally very difficult to know, in the 
absence of structural data, whether two proteins that lack clear 
sequence similarity are unrelated. This has meant that al- 
though previous evaluations have helped improve sequence 
comparison, they have suffered from insufficient, imperfectly 
characterized, or artificial test data. Assessment also has been 
problematic because high quality database sequence searching 
attempts to have both sensitivity (detection of homologs) and 
specificity (rejection of unrelated proteins); however these 
complementary goals are linked such that increasing one 
causes the other to be reduced. 

The publication com of this an.de were defrayed in part bv page chante 
payment. This nude must therefore be hereby markeT«<^mil„? ^ 
accordance with 18 VS.C. ,1734 solety ,o .nd.ca"^ 

PNAS is available online at http://www.pnit.oij. 



Sequence comparison methodologies have evolved rapidly 

of nXr^T* tetS h3S m ° der " v «sio« 

ml£ common J * med - F°r «ample. parameters in 
blast (1 have changed, and wu-blast: (2)-which produce 

SS e 4wf? mentS ~!" iS 66601116 available - "n* l««t version 

r T** fundamentally different results in the 
torm of statistical scoring. 

Fo^x^i 0 "*.^? T haVe ,eft **** m om knowledge. 
thLnK ' ,hMe haS v bcen 1,0 P ub,ished intent of 

f0r sco ^" g schemes more sophisticated than per- 
2'!^ ^ whta * discu « ed statistical scoring 

? Ve ^ w CtUa " y evalua,ed o" ««8e data? 

lpTOte ™- Mo'^ver. the different scoring schemes 
commonly in use have not been compared 

i„ l!! 3 ?^ ] heSt i * UCS L a a more tadttnemil question- 
in an absolute sense, how well does painvise sequence com 
panson work? That is. what fraction of homologous proems 
can be detected using modern database searching methods? 

In this work, we attempt to answer these questions and to 
overcome both of the fundamental difficult that have Sin 

£?, w aSSeSS , mem f?9«»« comparison methodologies. 
First, we use the set of distant evolutionary relationships in the 
scop: Structural Classification of Proteins database (4) wh ch 
'■kt^Zt l rom u $trUCIUraI md functional characteristics (5) 
mologs. which are known independently of sequence compar- 
^ h„?h WC 056 a " a ? essment «"«hod that jointly mea- 
ItrXhK se ? s,t,v,, y and specificity. This method allows 
straightforward comparison of different sequence searching 
procedures. Further, it can be used to aid inteVetation of real 
database searches and thus provide optimal and reliable 

n^ Vi ° US Asses 1 $raen,i of Sequence Comparison. Several 
previous stud.es have examined the relative performance of 
different sequence comparison methods. The most encom- 
passing analyses have been by Pearson (6. 7), who compared 
the three most commonly used programs. Of these, the Smith- 
Waterman algorithm (8) implemented in ssearch (3) is the 
oldest and slowest but the most rigorous. Modem heuristics 
have provided blast (1) the speed and convenience to make 
it the imost popular program. Intermediate between these two 
is fasta (3). which may be run in two modes offering either 
greater speed (ktup = 2) or greater effectiveness (ktup = n 
Pearson also considered different parameters for each of these 
programs. 

To test the methods. Pearson selected two representative 

p,?^' 7 J""" , C 0 " h rf f 67 Dr0,ein su P«families defined by the 

Z£Z V 9 l ^7 USed 35 8 ouer y ,0 se "ch the 
database, and the matched proteins were marked as being 
homologous or unrelated according to their membership of pir 

Abbreviation. EPQ. errors per query. 

Presem address: Department of Structural Biology. Stanford Uni. 
versity. Fairchild Build.ng D-109. Stanford. CA 94305 5126 
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likely, its power can be attributed! its incorporati n fmore 
inf rmati n than any other measure; it takes account f the 
full substituti n and gap data (like raw scores) but aiso has 
details about the sequence lengths and composition and is 
scaled appropriately. 

We find that statistical scores are not only powerful, but also 
easy to interpret, ssearch and fasta show dose agreement 
between statistical scores and actual number of errors per 
query (Fig. 4). The expectation value score gives a good 
slightly conservative estimate of the chances of the two se- 
quences being found at random in a given querv Thus an 
E-vaiueof0.01 indicates that roughly one pair of nonhomoiogs 
of this similarity should be found in every 100 different queries 
Neither raw scores nor percentage identity can be interpreted 
in this way, and these results validate the suitability of the 
extreme value distribution for describing the scores from a 
database search. 

The P-values from bj^st also should be directly interpret- 
able but were found to overstate significance by more than two 
orders of magnitude for 1% EPQ for this database. Nonethe- 
less these results strongly suggest that the analytic theorv is 
fundamentally appropriate. wu-BUum scores were more're- 
liable than those from bi^st, but also exaggerate expected 
confidence by more than an order of magnitude at 1% EPQ 
Overall Detection of Homologs and Comparison of Aleo- 
rithms. -Rie results in Fig. 5A and Table 1 show that pairwise 
sequence comparison is capable of identifying only a small 
fraction of the homologous pairs of sequences in PDB40D-B 
Even SSEARCH with E-values, the best protocol tested, could 
find only 18% of all relationships at a 1% EPQ. biast which 
identifies 15%, was the worst performer, whereas 'fasta 
"up - 1 is nearly as effective as ssearch. fasta ktup = 2 and 
WU-BLAST2 are intermediate in their ability to detect ho- 
mologs. Comparison of different algorithms indicates that 
those capable of identifying more homologs are generally 
s ower. ssearch is 25 times slower than Bl^ and 6.5 time? 
slower than fasta ktup = 1 wu-buvst2 is slightly faster than 
fasta ktup = 2, but the latter has more interpretabie scores 
In pdbwd-b, where there are many close relationships the 

h^o^noc °/c ^ C ™[ y ° n,y 38% of structurally known 
homologs (Fig. SB). The method which finds that many 
relationships is wu-blast:. Consequently, we infer that the 
differences between fasta kup = l, ssearch, and wu-blast^ 
programs are unlikely to be significant when compared with 
variation in database composition and scoring reliability 

Fig. 6 helps to explain why most distant homologs cannot be 
found by sequence comparison: a great manv such relation- 
snips have no more sequence identity than would be expected - 
by chance, ssearch with E-values can recognize >90% of the 
homologous pairs with 30-40% identity. In this region there 
are 30 pairs of homologous proteins that do not have signif- 
icant E-values, but 26 of these involve sequences with <50 
■T .^2 f se 1 uences havi ng 25-30% identity, 75% are 
hlrol it. ^ SSEARCH E - Va ' UeS - However ' althou *h ^e num- 
faiL Iff ? P g T S at lowcr levc,s of idcnli ^ lhc detection 
falls off sharply: only 40% of homologs with 20-25% identity 
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Fio. 6^ Distribution and detection of homoloss in pdwoivb Ram 
show the distribution of homologous pairs pdWm aec?rdTn^f * ,k 
Jdennty (usmg the measure of idemit^ 

sst iden,i,ied homo,ogs in h ^ 

tremely far in sequence and have <20% jdentirv Note that £e 
SSf?h "I* ,naceura ' e -«F*°»'lya, lc«l^e^of identity Rl^d 
25* or tZ h' aU m °« 'ela.ionsh.ps Si, havl 

qZL"^ ', d h ' >'• bu ' ,ts de,ec,,on wanes «»»n* 25%- 

SS M * ,he "quence divergence of most structurally 

£££ ^ se^r/JI rela,lonshi P s ef '«v defeat the abX " 
panwise sequence comparison to detect them. 

are detected and only 10% of those with 15-20% can be found 
These results show that statistical scores can find related 
SSVSZ iden '^ " r , e r kab,y low: however the^wet 

After completion of this work, a new version of oainvise 
BLait was released: bi^stgp (37). 1, supports Japped ato 
WV ;*^- «<« dispenses with sum statics Our 
E vl« 1 f l tf roP J T g defaU " P aram «« ^bw that its 
^ 1^ , 'f th3t " S OVeral1 de,ec,l0n ° f h °™logs 
was substantially better than that of ungapped blast, but not 
quite equal to that of wu-blastz 

CONCLUSION 

The general consensus amongst experts isee refs 7 24 25 ^ 
and references therein) suggests that the most effective 'se- 
quence searches are made by (, ) using a large current database 
in which the protein sequences have been complexity masked 
and (h) using statistical scores to interpret the results Our 
experiments fully support this view. 

Our results also suggest two further points. First, the E-val- 
ues reported by fasta and ssearch give fairly accurate 
estimates of the significance of each match, but the P-values 
provided by blast and wu-blastt underestimate the true 
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extern of errors. Second, ssearch, wu-biasT2, and fast a 
ktup « 1 perform best, though blast and fasta ktup = 2 
detect most f the relationships f und by the best procedures 
and are appropriate f r rapid initial searches. 

The homolog us proteins that are found by sequence com- 
parison can be distinguished with high reliability from the huge 
number of unrelated pairs. However, even the best database 
searching procedures tested fail to find the large majority of 
distant evolutionary relationships at an acceptable error rate 
Thus, if the procedures assessed here fail to find a reliable 
match, it does not imply that the sequence is unique; rather, it 
indicates that any relatives it might have are distant ones.** 

••Additional and updated information about this work, including 
supplementary figures, miy be found at h ttp://sss.ttanford.edu/gsA 

The authors are grateful to Drs. A. C. Murzin, M. Levitt. S. R. Eddy 
and G. Mitchison for valuable discussion. S.E.B. was principally 
supported by a St. Johns College (Cambridge, UK) Benefactors 1 
Scholarship and by the American Friends of Cambridge University. 

5. E.B. dedicates his contribution to the memory of Rabbi Albert T 
and CJara S. BiJgray. 

1. Altschul. S. F.. Gish. W., Miller. W., Myers. E W. & Lrpman. 
D. J. (1990) J. Mot Biol. 215, 403-410. 

2. AJtsehuL S. F. * Gish, W. (1996) Methods EnzymoL 2 66, 460- 
480. 

1 « ^ R * & Upman ' D ' J ' < 1988 > Set USA 

85, 2444-2448. 

4. Murzin. A. G.. Brenner. S. E.. Hubbard, T. & Chothia, C (1 995 ) 
/ Mot. Biol. 247, 536-540. v-noirua, K xy*» 

$ ' f^P CT :A E " Cholhia * C " Hubba "J, T. J. P. & Munin. A. G. 
( 1 996) Methods EnzymoL 266, 635-643. 

6. Pearson. W. R. (1991) Genomics U, 635-650 

7. Pearson, W. R. (1995) Protein Sci. 4, 1145-11*60 

8. Smith. T. F. & Waterman, M. S. (1981 )/ Mot. Biol. 147, 195-197. 

9. George D. C. Hum, LT, 4 Barker, W. C (1996) Methods 
EnzymoL 266. 41-59. 

10. Vogt G , Ettold T. Ac Argos. P. (1995)/ Mot. Biol 249, 816-831 . 

11. Henikoff, S. & Henikoff. J. G. (1993) Proteins 17, 49-61 

12. Bairoch, A. & Apweiler, R. (1996) Nucleic Adds Res. 24, 21-25 

SfTJ?' ^ BuChCr * P * * Hofmann > K- W»6) Nucleic Acids Res. 
*4, 189—196. 

R « n .ii?. f I- S 4 Henikoff - J - 0- Proc. Natl. Acad. Sci. USA 
89, 10915-10919. 

15. Dayhoff. M., Schwaru. R. M. & Orcutt, B. C. (1978) in Alias of 
Protein Sequence and Structure, ed. Dayhoff. M. (National Bio- 



Prve. Nail. Acad. Set. USA 95 (1998) 



" pp e *i«jT Ch F ° UndlUon - Silyef Sprio,. MD). Vol 5. SuppL 

16 ' UK)"""' S ' * 0W6) ^ ,heMI - < Vaiyenit y «* dmbridfe. 
17. Sander. C & Schneider. R. (1991 ) protevu 9, 56-68. 

716^738 M ' S ' * ° Ver ' npon - J> P - 0»3) J- Mel. BioL J33. 

2' A°M t S,en, ^ r «; M - J E- 0987) Protein Eng. 1. 89-94. 

TT% U M ' * Ch0,h * C < 1986 > H"™ »• 

21. Arra«i* _R Gordon. L4M.W. (1986) Ann. Stat. 14. 971-993. 
226^-2268 ' ' (,W0) ^ «* 

SS5n7* Altteh «"-S. F. (1993) fircc. Natl. Acad. Set USA 90. 

*■ arsi S'ltts M s - Gi,h - w 4 w ~ ,,on - c <iw4 > 

25. Pearson, W. R. (1996) Methods EnzymoL 266. 227-258 

26 " JSX J " ^"L' W * J - SmiUu T * R * W ««man. M. S. 
(1984) Nucleic Acids Rex 12, 215-226 

2? ' 55^57? J ' C ^ FcdCrhCn * S * (1996 > Zrvymol- ^ 

28. Waterman, i M. S. Sl Vingron, M. (1994) Science 9, 367-381. 

« ^ J- C A Watson. H. C (1965)/ JWoi Biol. 

Id, oo9— 678. 

30. Abola, E EJJwwtein. F. C, Bryant, S. H.. Koetzie. T. F. & 
Weng J. (198^ m Crystallographic Databases: Information Con- 
tent, Softwan - Systems, Scientific Applications, eds. Allen, F. H. 
Bergerhoff G. & Sievers, R. (Data Comm. Intl. Union Crystal 
logr., Cambridge, UK), pp. 107-132. 

31 . Brenner. S. E., Chothia. C & Hubbard, T. J. P. (1997) Curr. Odul 
Struct. Biol. 7, 369-376. ^ 

32. Orengo, C. Michie, A.. Jones S. Jones D. T. Swindells M. B & 
Thornton, J. (1997) Structure (London) 5, 1093-1108. 

33. Zweig M. H. <fc Campbell, G. (1993) Clin. Chem. 39, 561-577. 

34. Gribskoy. M. & Robinson, N. L. ( 1996) Comput. Chem. 20, 25-33. 

35. Fitch, W, M. (1966) / Mol. BioL 16, 9-16 

36. Chung, S. Y. & Subbiah. S. (1996) Structure (London) 4, 1 123- 

37. Aitschui. S. F Madden. T. L.. Schaffer. A. A.. Zhang. J„ Zhang. 
3389^340^ °" J ' 0997> NucUiC AcUis *" 

38 " 5 ir ^t R V Schmidt ' W * Jr ' Houston - T - Amma. E. & Huisman, 
T. (1979) / Mol. BioL 131, 417-433 

39 ' Wik0n ' ° * Karp,US ' P (1993 > biochemistry 32, 

yyuo— 9916 

4 °' « y ', e 4. R ^- * Mi,ner - Wh *' e - E. J. (1995) T««di BiocAem Sci. 

20, 374-376. 



6076 Biochemistry: Brenner et al. 




Fig. 4. Reliability of statistical scores in pdwod-b: Eaeh line shows 
the relationship between reported statistical score and actual error 
rate for a different program. E-values are reported for ssearch and 
facta, whereas P-values are shown for blast and wu-blacti. If the 
scoring were perfect, then the number of errors per query and the 
E-values would be the same, as indicated by the upper bold line. 
(P-values should be the same as EPO for small numbenTand diverge 
at higher values, as indicated by the lower bold line.) E-values from 
ssearch and facta are shown to have good agreement with EPO but 
underestimate the significance slightlv. blast and wu-blactj are 
overconfident, with the degree of exaggeration dependent upon the 
score. The results for fdbwd-b were similar to those for fdbmd-b 
despite the difference in number of homologs detected. This sraoh 
wore 10 r0UgWy CaUbrale ,he rehabil "y of a given statistical 

ignored in previous tests but is essential for the straightforward 
or automatic interpretation of sequence comparison results 
Further, it provides a clear indication of the confidence that 
should be ascribed to each match. Indeed, the EPQ measure 
should approximate the expectation value reported bv data- 
base searching programs, if the programs' estimates are accu- 
rate. 

The Performance of Scoring Schemes. All of the programs 
tested could provide three fundamental types of scores The 
first score is the percentage identity, which may be computed 
in several ways based on either the length of the alignment or 
tne lengths of the sequences. The second is a "raw" or 
'Smith-Waterman" score, which is the measure optimized bv 
the Smith-Waterman algorithm and is computed bv summing 
the substitution matrix scores for each position in' the align- 
ment and subtracting gap penalties. In blast, a measure 

Sequence Comparison Algorithms (POB40D-B) 
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related to this sc re is scaled int bits. Third is a statistical 
score based on the extreme value distribution. These results 
are summarized in Fig. 1. 
Sequence Identify. Though it has been long established that 

KS 01 !"? a 3 P°° r , measure <»>• ^re " ' common 
rule-of-thumb stating that 30% identity signifies homology. 
Moreover, publications have indicated that 25% identirv can 

oriZnf * ,hreS , h0 ' d ° 7 ' 36) - We f,nd ,hat these «»«h°lds. 
results. As databases have grown, so have the possibilities for 
chance alignments with high identity: thus, the reported cutoffs 
lead to frequent errors. Fig. 2 shows one of the rnanv pairs of 
p ote ns with very different structures that nonetheless have 

d£L ^ ?^ mt ? °T er consid '«°>* Signed regions 
Despite the high identity, the raw and the statistical scores for 
such incorrect matches are typically not significant. The prin- 
cipal reasons percentage identity does so poorlv seem to be 
«riL?£"°7 I* 0 ™ 8 " 00 zb °n gaps and about the conser- 
vative or radical nature of residue substitutions 

idS ?VTm B " Fi * 3 > we that 30% 

identity is a reliable threshold for this database onlv for 
sequence alignments of at least 150 residues. Because' one 
unrelated pair of proteins has 43 S% identity over 62 residues, 

m length before 40% is a reasonable threshold, for a database 
of this particular size and composition 

At a given reliability, scores based on percentage identity 
detect just a fraction of the distant homologs 8 foun«l l b? 

fhtS 'f 0rmg ' If T mMSUre$ ,he Percentage identity % 
the aligned regions without consideration of alignment length, 
ton a negligible number of distant homologs are detected. 

^n,?rl , K e . HSSP T ati0n improves lhe value of percentage 
identity, but even this measure can find only 4% of all kno£i 

homologs at 1% EPQ. In short, percentage identirv discards 

most of the information measured in a sequence comparison. 

Kaw scores. Smith-Waterman raw scores perform better 

^•fiT 1 " 8 - 86 idem " y {Fi «- 1 >' but ln - scali "6 ( ? ) P f0V 'ded no 
notable benefit in our analysis. It b necessary to be verv precise 
when using either raw or bit scores because a 20% chanae in 
cutoff score could yield a tenfold difference in EPQ. However 
it is difficult to choose appropriate thresholds because the 
reliability of a bit score depends on the lengths of the proteins 
matched and the size of the database. Raw score thresholds 
also are affected by matrix and gap parameters 

Statistical Scores. Statistical scores were introduced partly 
to overcome the problems thai arise from raw scores This 
scoring scheme provides the best discrimination between 
homologous proteins and those which are unrelated Most 
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superfamiJies. Pearson found that modern matrices and "In- 
scaling" f raw scores improve results considerably. He also 
reported that the rig rous Smith- Waterman aig rithmw rked 
slightly better than fasta, which was in turn more effective 
than blast. 

Very large scale analyses of matrices have been performed 
(10), and Henikoff and Henikoff (11) also evaluated the 
effectiveness of blast and fasta. Their test with blast 
considered the ability to detect homologs above a predeter- 
mined score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-prot database (12) and used prosite (13) 
to define homologous families. Their results showed that the 
BLOSUM62 matrix (14) performed markedly better than the 
extrapolated PAM-series matrices (15), which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homologs. But in 
Pearson s and the Henikoffs' evaluations of sequence com* 
parison. the correct results were effectively unknown. This is 
because the superfamilies in PlR and prosite are principally 
created by using the same sequence comparison methods 
which are being evaluated. Interdependency of data and 
methods creates a "chicken and egg" problem, and means for 
example, that new methods would be penalized for correctly 
identifying homologs missed by older programs. For instance, 
immunoglobulin variable and constant domains are clearly 
homologous, but pir places them in different superfamilies. 
The problem is widespread: each superfamily in pir 48.00 with 
a structural homolog is itself homologous to an average of 1.6 
other pir superfamilies (16). 

To surmount these sorts of difficulties, Sander and Schnei- 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms, their work focused on determining a length- 
dependent threshold of percentage identity, above which all 
proteins would be of similar structure. A result of this analysis 
was the hssp equation; it states that proteins with 25% identity 
over 80 residues will have similar structures, whereas shorter 
alignments require higher identity. (Other studies also have 
used structures (18-20), but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection.) 

A general solution to the problem of scoring comes from 
statistical measures (i.e., E-values and P-values) based on the 
extreme value distribution (21). Extreme value scoring was 
implemented analytically in the blast program using the 
Karlin and Altschul statistics (22, 23) and empirical ap- 
proaches have been recently added to fasta and ssearch. In 
addition to being heralded as a reliable means of recognizing 
significantly similar proteins (24. 25), the mathematical trac- 
tability of statistical scores "is a crucial feature of the blast 
algorithm" ( 1 ). The validity of this scoring procedure has been 
tested analytically and empirically (see ref. 2 and references in 
ref. 24). However, all large empirical tests used random 
sequences that may lack the subtle structure found within 
biological sequences (26, 27) and obviously do not contain any 
real homologs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24, 25, 
28), there have been no large rigorous experiments on biolog- 
icaJ data to determine the degree to which such rankings are 
superior. 

A Database for Testing Homology Detection. Since the 
discovery that the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29), it has been 
apparent that comparing structures is a more powerful (if less 
convenient) way to recognize distant evolutionary relation- 
ships than comparing sequences. If two proteins show a high 
degree of similarity in their structural details and function, it 



is very probable that they have an evolutionary relationship 
though their sequence similarity may be low. 

The recent growth f protein stricture information com- 
bined with the comprehensive evolutionary dassiflcati n in 
the scop database (4, 5) have allowed us t overt me previous 
limitations. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 
whose relationships are known confidently. The scop database 
uses structural information to recognize distant homologs, the 
large majority of which can be determined unambiguously. 
These superfamilies, such as the globins or the immunoglobu- 
lins, would be recognized as related by the vast majority of the 
biological community despite the lack of high sequence sim- 
ilarity. 

From scop, we extracted the sequences of domains of 
proteins in the Protein Data Bank (PDB) (30) and created two 
databases. One (pdbwd-b) has domains, which were all <90% 
identical to any other, whereas (pdb40D-b) had those <AO% 
identical. The databases were created by first sorting all 
protein domains in scop by their quality and making a list. The 
highest quality domain was selected for inclusion in the 
database and removed from the list. Also removed from the list 
(and discarded) were all other domains above the threshold 
level of identity to the selected domain. This process was 
repeated until the list was empty. The PDB40D-B database 
contains 1 ,323 domains, which have 9,044 ordered pairs of 
distant relationships, or -0.5% of the total 1,749,006 ordered 
pairs. In pdbwd-B, the 2,079 domains have 53,988 relation- 
ships, representing \2% of all pairs. Low complexity regions 
of sequence can achieve spurious high scores, so these were 
masked in both databases by processing with the SEG program 
(27) using recommended parameters: 12 1.8 2.0. The databases 
used in this paper are available from http://sss.stanford.edu/ 
sss/. and databases derived from the current version of SCOP 
may be found at http://scop.mrc-lmb.cam.ac.uk/scop/. 

Analyses from both databases were generally consistent, but 
PDB40D-B focuses on distantly related proteins and reduces the 
heavy overrepresentation in the PDB of a small number of 
families (31, 32), whereas PDBWD-b (with more sequences) 
improves evaluations of statistics. Except where noted other- 
wise, the distant homolog results here are from PDB40D-B. 
Although the precise numbers reported here are specific to the 
structural domain databases used, we expect the trends to be 
general. 

Assessment Data and Procedure. Our assessment of se- 
quence comparison may be divided into four different major 
categories of tests. First, using just a single sequence compar- 
ison algorithm at a time, we evaluated the effectiveness of 
different scoring schemes. Second, we assessed the reliability 
of scoring procedures, including an evaluation of the validity 
of statistical scoring. Third, we compared sequence compari- 
son algorithms (using the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we examined the 
distribution of homologs and considered the power of pairwise 
sequence comparison to recognize them. All of the analyses 
used the databases of structurally identified homologs and a 
new assessment criterion. 

The analyses tested blast (1), version 1.4.9MP, and wu- 
BLast: (2), version 2.0al3MP. Also assessed was the fasta 
package, version 3.0t76 (3), which provided Fasta and the 
ssearch implementation of Smith-Waterman (8). For 
ssearch and fasta, we used BLOSUM45 with gap penalties 
-12/-1 (7, 16). The default parameters and matrix (blo 
SUM62 ) were used for BLAST and wi>blasT2. 

The "Coverage Vs. Error" Plot. To test a particular protocol 
(comprising a program and scoring scheme), each sequence 
from the database was used as a query to search the database, 
This yielded ordered pairs of query and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The idea) method would have 



